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Abstract The principal component analysis (PCA) is a frequently used machine learning method. 
In this paper, the PCA operation is explained by examples with Python program illustration. A proof 
of the diagonalizability of real symmetric matrix is also included, which may help to understand the 
mathematics behind PCA. 


The principal component analysis (PCA) performs linear combinations of the variables to obtain the 
few principal components that are mutually uncorrelated. Most information of the original data is kept 
in the principal components. As a method of data dimensionality reduction, the PCA plays an important 
role in data processing [1, 2]. The principle of PCA is singular value decomposition (SVD), which involves 
quite a few crucial concepts of linear algebra such as eigenvalue, eigenvecotor, basis change, orthogonality, 
symmetric matrix, and matrix multiplication, and so on. The linear algebra is essential to the theory 
and practice of big data, machine learning, algorithms, and programming. In this paper I use examples 
with Python program to explain PCA, step by step, till the mathematical proof of PCA and SVD. It is 
hoped that the interpretation of PCA will open the door of machine learning for the numerous machine 
learning enthusiasts. 


Results 


Euler’s formula and rotation of points in a two-dimensional plane 


The Euler’s formula states that et? = cos@ + isin, where e is the natural logarithmic base, i is the 
imaginary unit, and @ is a real number. Now suppose a point M with coordinate (cos, sin0) within the 
unit circle in a two-dimensional plane. By rotating the M along the unit circle to M’ with coordinate 
(cos(@ + ô), sin(@ + 6)), we have 


ei(9+5) — eil eið 


= (cos0 + isinð)(cosô + isind) 


= cosOcosd + icosésind + isinOcosé + i? sindsind 


= (cos0cosd — sin@sind) + t(cosOsind + sinOcosd) 


Therefore, the coordinate of M’ becomes (cos0cosd — sin@sind, cossin + sinOcosd). For a point 
N(rcos0, rsin), r € R, rotating by ô to N’ will change the coordinate to (r(cosOcosé — sinOsind), 

r(cos@sind + sinOcosd)), which can be written as (rcos6coséd — rsinOsind, rcosOsind + rsinOcosd). Let 
rcos) = a,rsin@ = b, the coordinate of N’ is (acosd — bsind, asind + bcosd). The rotation can be shown 


with matrices. 
cos —sind a | _ | acosd — bsind (1) 
sind cosô b | | asind + bcosd 


-a| cosé | -| d z 


sind cosé 
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or 


[a b] | k | = | acosô — bsind asind + bcosô | (3) 


=a| cos sind |+b| —sind cosô | (4) 


In view of rotation, for Equation 1, the left side rotates point (a, b) by 6 and the right side shows the result 
of operation. From the perspective of basis change, the basis vectors of [a b]T changes from [1 0] 
and [0 1}? to [cosd sinô]! and [—sind cosô]T, respectively, shown in Equation 2. The situation of 
Equation 3 is similar, in which the left side represents rotation and the right side shows what is obtained. 
Equation 4 shows the basis change from [1 0] and [0 1] to[cosd sind] and [—sind cosô], respectively. 


Variance, covariance and rotation 


If we have a dataset Data = { (£1, 1), (2, y2),--: , (£N, yn) }, where zi, yi E€ R, i = 1,2,--- , N, then the 
data can be drawn with a scatter diagram in the rectangular coordinate system. However, the center 
of these points is not necessarily the origin (0,0). Therefore, we get the means of x direction and y 
direction, which are = 5S7 Ui, y= |. yi, respectively, and then subtract them from the original 
data to obtain the centered data Data_new = { (x1 — Z, y1 — Y), (£2 — Z, y2 — Y), +- , (an —Z, yn —9)}, 
designated as Data_new = {(x1, y1), (5, y5),--: , (ay, yn) }. The centered data has the same internal 
structure as the original data. Now we can use the centered data to get the variances of original data, 
which are 02 = D7 x? /(N — 1), 02 = TY y?/(N — 1). 


Now consider a small dataset Data = {(58, —88), (13, 18), (26, —29), (—26, 111), (—6, 55), (—16, 67), 
(103, —138), (67, —104), (20, 10), (18, 36)}, which is shown with gray points in Figure 1. The blue points 
indicates the centered data Data_new. 
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Figure 1: Scatterplot of the datasets 


Data has 10 points, which is roughly distributed along a line in the scatterplot. In the direction of 
this line, the variance of projections is large, whereas that of projections is small in the perpendicular 
direction. Considering the structural characteristics of the data, we can simplify it, that is, simplify the 
two-dimensional data into one-dimensional data by finding the direction with the largest variance, and 
projecting the data points to this direction. In the other (perpendicular) direction, the point values do 
not change much, and they can be discarded. The question is, how to find the direction with the greatest 
variance? 


Let’s start with a general derivation and then look at the specific example. The sample size N is a fixed 
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value here. Thus, we pay our attention to the Szr, Syy, and Sz, of the Data. 


N 
See =a +agt+--+ay =) a? (5) 
1 
N 
Sy =YP tue t tun => uP (6) 
1 
N 
Soy = Ty, HEY +--+ ey = Y aly; (7) 
1 


To find the direction of the greatest variance, we rotate the centered data according to Equation 
3. The data after rotation becomes {(xįcosô — yi sinô, x sind + yi cosô), (xhcos — yhsinð, xhsind + 


ygcosd),-++ , (x'ycoså — yh sind, xy, sind + yycoså)}, which can be substituted to Equations 5-7. 

N 

See X (2icosd — ysin)? (8) 
1 
N 

Soy = X (z;sinð + yjcos6)? (9) 
1 
N 

Siy = X (zicoső — y;sinð)(x;sinð + yicosð) (10) 


1 


Comparing Equations 5-6 and Equations 8-9, we can find that the Sre + Syy keeps unchanged before and 
after rotation. This is easy to understand, as rotation does not change the distance between point and 
origin. It can also be shown by calculating the distance from point to origin directly with the distance 
formula. Now consider Equation 8. To get the extreme values, we need to calculate the first derivative. 


N 
(Sia) = 5 2(x;coså — y;sinð)(x;cosð — y;sinð)' 


1 
N 

= 5 2(x;coså — y;sinð)(—xsinå — y;coså) 
1 


=-2 X (z;cosð — yisind) (xi sind + y;cosô) (11) 
1 


Comparison of Equation 11 and Equation 10 shows that Szy = 0, when (Sss) = 0. Let’s calculate the 
first derivative of Equation 9. 


N 


N 
(Sia = 5 2(aisind + yicosd) (xi sind + ycosô) 
1 
=2 X (z;cosð — y;sinð)(x;sinð + y;cosð) (12) 
1 


It can be shown by comparing Equation 12 and Equation 10 that Syy = 0 when (Syy)! = 0. To get the 
extreme values, we need to calculate the first derivative of Equation 10. 


N 
Se) = —x'.sind — yicosd)(a,sind + yicosd) + (acosdé — y;sind)(x;coså — ysin) 
y 
1 
(13) 
When (Sey)' = 0, 
N N 
X (xi sind + y/cos6)” = X (z!cosô — y sinð)? (14) 
1 1 


w 


When we compare Equations 8, 9, and 14, we can easily see that Syy = Sze at this time. Thus, when 
Szr reaches extreme values, Sz, gets to zero. When Sz, reaches extreme values, Sz, equates Syy. 


Now let’s deal with the solutions of (S+) = 0. Due to the circumferential periodicity of 6, we need 
only consider (—7,7). As the data has been centered, only (—7/2,7/2) needs to be included, in which 
cos #0. According to Equation 11, 


N 
—2 X (z;coső — yisinð)(x;sinð + y;cosô) = 0 
1 
N 


So (2) — y;tanð)(x;tanð + y;) = 0 
1 
N 


se Prand + ay, — xiyitan?d — yi?tand) = 0 
1 


N N 
riyitan?5 — Gr — y! )tanð — 3 xy, =0 
1 1 


tan? ys = i tanô—-1=0 (15) 


The discriminant of Equation 15 


N I2 2 
Asy E E i450 


i: 
I Tyi 


Equation 15 has two unequal roots, tanôı and tans, one of which corresponds to maximum Syys and 
the other corresponds to minimum S,,. According to Vieta’s Theorem, tand,tandg = —1. Within 
(—1/2, 7/2), 6, and d2 are 7/2 apart. 


Briefing Equations 8-15, we find that as 6 changes within (—7/2,7/2), when Ssg reaches maximum, Syy 
gets to minimum and Szy gets to zero; when Syy reaches minimum, Syy gets to maximum and Sz, gets 
to zero; the 6s are 7/2 apart for Sss maximum and minimum. 


Now we consider the solution of Equation 14. 


N N 
So (xitand + yi)? So (z; yjtand)? = 
1 1 
N N 
ee. 2 _ y!2) 
So (ah y; )tan 6+45) aly itand — 2 =0 
1 1 
aly! 
tan?6 4 iS 7p yp tand -1=0 (16) 
1 a a 


The discriminant of Equation 16 


=a ts 244>0 


Equation 16 has two unequal roots, tand{ and tanô,, one of which corresponds to maximum Syy and 
the other corresponds to minimum S,,. According to Vieta’s Theorem, tandjtand, = —1. Within 
(—1/2, 7/2), 64 and 65 are 7/2 apart. 


With the quadratic formula, we can get the roots of Equations 15 and 16. Here I present the Python 
script PCA_example_1.py for doing this job. Also, the script shows the dynamic process of rotating 
during which changes of Sz, Syy, and Sz, can be clearly seen (Figure 2). 
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Figure 2: Changes of Srez, Syy, and S,,during rotation 


Basis change for data matrix 


I have shown how to understand the process of finding the best direction of a dataset by rotation. Next, 
I would like to explain the process with matrix and change of basis. The dataset Data is represented by 


a matrix, with each line for one point. The matrix is then centered to matrix A. 


Data = 


58 

13 

26 
—26 


—88 
18 
—29 


32.3 —81.8 
—12.7 24.2 
0.3 —22.8 
—51.7 117.2 
—column mean —31.7 61.2 
—41.7 = 73.2 = 
77.3 —131.8 
41.3 —97.8 
—5.7 16.2 
—7.7 42.2 


— Sxx/S_yy 
— Syy/S_xx 


Then conduct change of basis, 


32.3 —81.8 
—12.7 24.2 
0.3 —22.8 
—51.7 117.2 
—31.7 61.2 cos sinô 
—41.7 73.2 | —sind cosô | SAV 
77.3 —131.8 
41.3 —97.8 
—5.7 16.2 
—7.7 42.2 


The matrix AV is still 10 x 2. At this time, the Sss, Sry, Syz, and Syy can be represented by matrix 
(AV)TAV. We have (AV)TAV = VTATAV = VT(ATA)V. For a dataset, the A is fixed. Consequently, 
ATA is fixed. However, V is changeable. Let’s caculate ATA. 


14394.1  —28652.6 


—28652.6 59615.6 (17) 


Ata =| 


The ATA is a symmetric matrix, which can be diagonalized. With the eigenvectors and eigenvalues of 
ATA, we get 


ava | 143941 -28652.6 
=| -28652.6 59615.6 
__ | —0.43618794 —0.89985559 | | 73504.40482362 0 —0.43618794 0.89985559 
~ | 0.89985559 —0.43618794 0 505.29517638 —0.89985559 —0.43618794 
= QAQ" (18) 


In Equation 18, the column vectors of Q are eigenvectors of ATA; the diagonal elements in the diagonal 
matrix A are the eigenvalues corresponding to the eigenvectors of ATA; Q is an orthogonal matrix 
satisfying Q~! = QT. Then we have 


(AV)TAV = VTATAV = VT(ATA)V = VT(QAQ®V = (V'Q)A(QTV) (19) 


Now look carefully at the far right of Equation 19. A is a diagonal matrix, which behaves well. Q is 
fixed while V can be changed. Now is the most critical time. If we select V carefully [3], which satisfies 
QTV = I, namely, V = Q, then (QTV)T = VTQ = I. Then Equation 19 becomes 


(AV)TAV =IAI=A (20) 


It looks neat. Therefore, for the dataset that has undergone such a basis change, Sse = A, = A(1,1) = 
73504.40482362, Syy = Ag = A(2,2) = 505.29517638, Szy = Sys = 0. Such results are the same 
as with those of the rotation method that we have done previously (see the output of Python script 
PCA_example_1.py ). 


Since rotation can be regarded as change of basis, we compare the basis via rotation with the basis 
via careful selection of V = Q. For Equation 19, when V = Q, the basis is V and the basis vec- 
tors are [—0.43618794 — 0.8998555] and [0.89985559 — 0.43618794], which are shown with oliver 
axes in Figure 3. Previously, we get the 6 and corresponding basis vectors in the rotation matrix are 
[(0.43618794 0.89985559] and [—0.89985559 + 0.43618794], which is represented by the magenta axes in 
Figure 3. 


200 


100 |- 


—100 }- 


200 : 
—200 —100 0 100 200 


Figure 3: rotation and basis change of dataset 


Obviously, the directions of the basis vectors are opposite. The periodicity of Equation 15 is m. If we 
add a 7 to the 6, then we get the basis vectors the same as those of V. Or, we can change the signs of 
basis vectors in V to get the same basis vectors as those of rotation matrix. Here is the proof. According 
to Equation 18, 


ATA = QAQT 


_ à 0 qt | 
=la | 0 vl 
= Mq qF + A2G293 

= à\ı(—q1)(—q7 ) + à2(—q2)(—q7) 


T 
= | -qı -a ] |% A] (21) 
It should be noted that the rules of block multiplication for matrices is used here. We shall need to 
apply the rule once more when prove the diagonalizability of real symmetric matrices later. Thus, I have 
shown that sign changes of any column vectors in Q, do not change ATA. If signs of column vectors in Q 
are changed, due to V = Q, the basis vectors in V are changed accordingly. However, these operations 
do not change A. 


A slightly more complex example of PCA 


We have known how to find the direction of greatest variance in a two-dimensional plane. The dimension 
of example dataset is only two, which are the simplest. The method of rotation and the method of basis 
change are pretty much the same, in terms of interpretability and computational complexity. However, 
the datasets we meet with in reality are more often than not high-dimensional. For these datasets, it is 
easier for us to understand and operate by method of basis change. Let’s consider a multi-dimensional 
example, which is the Iris dataset from scikit-learn [4]. The sample size is 50, and each individual contains 
4 features. Therefore the original dataset can be represented with a 150x4 matrix. 


The dataset is centered, and then the PCA process in the scikit-learn is used to get the new basis 
V1. We also calculate the new basis V2 according to the method of basis change (see Python script 
PCA_example_2.py). The centered data is in the 4-dimensional coordinate system where each point 
has a coordinate (x,y,z,w). The sum of Sse, Syy, Szz, and Sw» is the same before and after basis 
change. 


Let’s see the results of scikitlearn and basis change. 


0.36138659  0.65658877 —0.58202985 —0.31548719 
—0.08452251 0.73016143 0.59791083 0.3197231 

0.85667061 —0.17337266 0.07623608 0.47983899 

0.3582892 —0.07548102 0.54583143 —0.75365743 


Vi= 


—0.36138659 0.65658877 0.58202985 0.31548719 
0.08452251 0.73016143 —0.59791083 —0.3197231 
—0.85667061 —0.17337266 —0.07623608 —0.47983899 
—0.3582892 —0.07548102 —0.54583143 0.75365743 


V2=0= 


According Equation 21, the signs of column vectors in Q can be changed. After change the signs of 
columns 1, 3, and 4 in V2, V1 = V2’. We go on to compare the singular values from scikit-learn 
and basis change, and the results are the same (see output of PCA_example_2.py). The square of 
singular value is the À in diagonal matrix A. In this example, 41 = 630.0080142, Az = 36.15794144, 
Ag = 11.65321551, A4 = 3.55142885; Di Ai = 681.3706000000001 and Ay + Ag account for 0.977. This 
means that most of the variance is contributed by A; and Az, which correspond to the 1st and 2nd columns 
in the data after basis change. Therefore, we can keep the first two and discard the rest columns in the 
data after basis change, which simplifies the data while retains most of the information in original data. 


Till now we have two examples, in which we look for the direction of the greatest variance, the direction of 
the second greatest variance, and so on. This process is PCA. PCA is an unsupervised data dimensionality 
reduction method, using SVD to perform basis change for the data. The variance on the first basis vector 
of the new basis is the largest; the variance on the second basis vector of the new basis is the second 
largest, and so on. Meanwhile, according to the eigenvalues, we can decide which columns in the data 
after basis change are to be kept so as to achieve data dimensionality reduction. 


Real symmetric matrices are diagonalizable 


The diagonalizability of real symmetric matrices is the foundation of PCA by SVD. We need to under- 
stand the diagonalizability before we are able to use PCA confidently. The proof is divided into these 
steps. (1) A real symmetric matrix of order n has n eigenvalues (including repeated eigenvalues); (2) All 
eigenvalues of a real symmetric matrix are real numbers; (3) Eigenvectors with unequal eigenvalues are 
orthogonal, (4) Eigenvectors with equal eigenvalues are also orthogonal. 


For the above four steps, (1) is guaranteed by the Fundamental Theorem of Algebra; (2) and (3) can 
refer to Gilbert Strang’s Introduction to Linear Algebra [5]. The more troublesome thing is the proof of 
(4). Two proofs are provided in the book [5]. One uses Schur’s theorem and the proof given has been 
detailed. The other is to solve the eigenvalues and eigenvectors one by one for a real symmetric matrix, 
which has not been elaborated on therein. I will go slowly here and make the proof understood once for 
all. 


Before going, we have (i) Any n-dimensional nonzero vector can start from it to construct an orthogonal 
matrix of order n; (ii) The product of two orthogonal matrices is also an orthogonal matrix. For (i), we 
can do easily with Gram-Schmdit. (ii) is also evident. 


(Q1Q2)7(Q1Q2) = QF QTQ1Q2 = QFIQ2 = QF Q2 =I 


Now let’s start. Suppose Sn is a real symmetric matrix of order n. The solution equation for its 
eigenvalues is a univariate nth degree equation with n real roots, possibly including repeated roots. 
From these n real roots, we can find the maximum root, A, and get the corresponding eigenvecotor qf. 
By (i) previously, we can start from gj to construct an orthogonal matrix of order n, [q]? qf --- q2], 
designated as Qı. The g? is orthogonal to q?,q%,--- ,q™, and (Q7S,Q1)* = QTSTQTT = QT SnQ1. 


We have 


a, 
q? 
QT SaQI = . Srila? 3 q% | 
qn)" 
a 
q3 
= i [ Sna? Sna} Sndn | 
qn) 
gf) Sng? (qf)'Snqk =~ (ef) Sna® 
| (3) Sna? (a3) Saaz >o (G2) Snan 
qn) Sna? (qe) 5,08 (az) Snar 
at) Mat (at)"Snaz o (aT Snae 
O | (3a? (a3) Sna3 o (a3) Snan 
aza? (q@)"Sna® o (q®)TSnam 
à (aP) Sng? o (aT Snare 
_ 0 (43) Sng Ae q3)" Sng? 
A o De iE oiea Opec 
| 01 GE) Sn ae q3) Snan 
0 | (a2) Sna3 qn)" Snqn 
à` 0 
-| : a | (22) 


In Equation 22, Sn—1 is a real symmetric matrix. Now the right side of Equation 22 has A; in its diagonal. 
We still have other eigenvalues that need to be placed on the diagonal. The advantage now is that Sn—1 
remains a real symmetric matrix, but with one less order than the original, only order n-1. For S,_1, 
the same method can be applied to find its maximum eigenvalue and corresponding eigenvector. In this 
case the eigenvector of Sn—1 is n-1 dimensional. Therefore, the (Q')T and Q} t, which are placed on 
both sides of Sp—1, are orthogonal matrices of order n-1. To keep the dimension consistent with Sn, we 


need to put Oo into the nth order matrix. Here we use rules of block multiplication for matrices, 


211518] -[5 4 ; 


Equation 23 shows that the multiplication can be regarded as A acts on B. Therefore, a new round of 
operation upon Equation 22 is 


AO 
Q2Q7TSnQiQ2= | 0A 0 (24) 
0 i 0 | Sn—2 
where 
1' 0 0 0 
lta a eae 


Obviously, Qə is an orthogonal matrix. The maximum eigenvalue of Sn—1 is Ag, and the corresponding 
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n-1 


eigenvector is q}  - Continue as described above, we have 


Qla QIQOT OnI? Qna = 
Sı 


An 
Due to (i) previously, the left side of Equation 25 can be rearranged. 


Tg Qe QT on QiQa*Qa-1 = (QE e QTI) Sn (QIQ Qna) 
= (Q1Q2--- Qn—1)" Sn(Q1Q2°-*Qn-1) (26) 


For the right side of Equation 26, let QiQ2---Qn-1 = Q, then Q is an orthogonal matrix. The Equation 
26 can be written as 


1° QIQOT SQQ Qn- = QnQ (27) 
Substitute into Equation 25, we have 
Al 
A2 
Q75,Q = 
An 
SQ = QA (28) 
Sn = QAQr (29) 


Equation 28 proves the column vectors of Q are eigenvectors of Sn. The number of eigenvectors is n and 
eigenvectors are orthogonal to each other. Equation 29 proves real matrices are diagonalizable. E 


Now we can better understand PCA by SVD. The standardized data with mean zero and variance one 
is stored in a matrix A. The shape of A is m x n, where m is sample size and n is data dimension. Then 
ATA is a real symmetric matrix. By Equation 29, ATA = QAQT. Let V = Q, and conduct basis change 
to A with V. Then we have 
(AV)T(AV) = VTATAV 

= VT(ATA)V 

= V"(QAQ")V 

= (VTQ)A(QTV) 

= (QTQ)A(QTQ) 

=A (30) 


10 


If all elements on diagonal of A in Equation 30 are positive, then 


AV =[Avy Av2-:: Xun] (31) 
VA 
z 42 Av2 Ata VA2 
VAL Vr2 VAn 
VAn 
ai 
02 
=[uy uz: un (32) 
On 
=U% (33) 
A=Uxavt (34) 
Obviously, the column vectors in the right side of Equation 31 are orthogonal to each other. Also, the 
column vectors in U in Equation 33 are orthogonal to each other and oj, 02, +++, On are called sigular 
values. Equation 33 is PCA and Equation 34 is SVD. | 
Discussion 


Linear algebra is an important mathematical foundation of machine learning, and PCA is not only one of 
the common algorithms of machine learning, but also an important content of linear algebra. Compared 
with other materials that introduce PCA, this paper explains PCA from easy to difficult, by examples, 
and with numbers and shapes combined. The writing is easy to understand and the requirements for 
mathematical background are few. It would be my pleasure if you could lay a solid foundation and feel 
fun for machine learning. 


Methods 


Two Python scripts are provided to help understand PCA. The Pytohn version is 3.10.4 and the packages 
needed are numpy 1.21.6 [6], matplotlib 3.5.2 [7], and sklearn 0.0 [4]. The appendix of reference [3] 
includes some of the basics of linear algebra that are important for understanding PCA by SVD. 


Code availability 


The Python scripts are available at https://github.com/shiqiang-lin/understanding_pca. 
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